Report Generation: Adding basic online evaluation scores by lotif · Pull Request #46 · VectorInstitute/eval-agents

lotif · 2026-02-11T17:16:43Z

Summary

Adding basic online evaluations for the report generation agent. Those evaluations are meant to be run by the "production" environment and will produce and upload scores to langfuse on the following:

Checking if the final result is present and contains a string match
Adding scores for latency, token count and cost
- Those have been added to langfuse.py so they can be easily reused by other agents

This is how it those scores are displayed in the Langfuse UI:
Trace detail page:

Dashboard page:

Clickup Ticket(s): NA

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📝 Documentation update
🔧 Refactoring (no functional changes)
⚡ Performance improvement
🧪 Test improvements
🔒 Security fix

Changes Made

Small refactor to split report generation evaluations into online and offline
Adding a function to upload scores for final result against a string match
Adding functions to langfuse.py to upload scores on latency, token count and cost for a trace
Adding the function calls to send evaluations on each run of the demo UI for the report generation agent
Small fix to the UI for better output formatting

Testing

Tests pass locally (uv run pytest tests/)
Type checking passes (uv run mypy <src_dir>)
Linting passes (uv run ruff check src_dir/)
Manual testing performed (describe below)

Manual testing details:

Tested the UI and checked the resulting scores in langfuse.

Checklist

Code follows the project's style guidelines
Self-review of code completed
Documentation updated (if applicable)
No sensitive information (API keys, credentials) exposed

…lity

…e file

amrit110

Just the one comment. Otherwise looks good from my side.

amrit110 · 2026-02-17T17:47:31Z

aieng-eval-agents/aieng/agent_evals/report_generation/agent.py

@@ -37,6 +38,7 @@ def get_report_generation_agent(
    instructions: str,
    reports_output_path: Path,
    langfuse_project_name: str | None,


I think this new parameter is missing a docstring

lotif added 3 commits February 11, 2026 11:33

Adding final response evaluation and some minor improvements

071af24

Finished online scores, need to put it in a thread

cad754a

Moving the score reporting to the lasngfuse module for better reusabi…

3aedc8e

…lity

lotif requested review from amrit110 and fcogidi February 11, 2026 17:16

lotif and others added 11 commits February 12, 2026 11:10

Adding missing init files

aed2cf2

Using the trace fetch functions instead of making them myself

efa320e

Adjusting the token threshold to 15k

1de6a37

Adding log for every metric reported

ac7ba2c

Adding agent.py to the demo as well

0bf2351

Using init_tracing instead

1f29c7d

Merge branch 'main' into marcelo/online-eval

7d2471d

Merge branch 'main' into marcelo/online-eval

046025e

Merge branch 'main' into marcelo/online-eval

5b027c9

Adding max concurrency parameter to evaluation and updating the readm…

fd69cc3

…e file

Adding one more paragraph

a850f09

amrit110 approved these changes Feb 17, 2026

View reviewed changes

amrit110 and others added 2 commits February 17, 2026 12:49

Merge branch 'main' into marcelo/online-eval

02e6318

Adding missing docstring

c30b56f

lotif merged commit 0b2b06e into main Feb 17, 2026
3 checks passed

lotif deleted the marcelo/online-eval branch February 17, 2026 18:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report Generation: Adding basic online evaluation scores#46

Report Generation: Adding basic online evaluation scores#46
lotif merged 16 commits intomainfrom
marcelo/online-eval

lotif commented Feb 11, 2026 •

edited

Loading

Uh oh!

amrit110 left a comment

Uh oh!

amrit110 Feb 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lotif commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of Change

Changes Made

Testing

Checklist

Uh oh!

amrit110 left a comment

Choose a reason for hiding this comment

Uh oh!

amrit110 Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lotif commented Feb 11, 2026 •

edited

Loading